PyRAD: assembly of de novo RADseq loci for phylogenetic analyses
نویسنده
چکیده
MOTIVATION Restriction-site-associated genomic markers are a powerful tool for investigating evolutionary questions at the population level, but are limited in their utility at deeper phylogenetic scales where fewer orthologous loci are typically recovered across disparate taxa. While this limitation stems in part from mutations to restriction recognition sites that disrupt data generation, an additional source of data loss comes from the failure to identify homology during bioinformatic analyses. Clustering methods that allow for lower similarity thresholds and the inclusion of indel variation will perform better at assembling RADseq loci at the phylogenetic scale. RESULTS PyRAD is a pipeline to assemble de novo RADseq loci with the aim of optimizing coverage across phylogenetic datasets. It uses a wrapper around an alignment-clustering algorithm, which allows for indel variation within and between samples, as well as for incomplete overlap among reads (e.g. paired-end). Here I compare PyRAD with the program Stacks in their performance analyzing a simulated RADseq dataset that includes indel variation. Indels disrupt clustering of homologous loci in Stacks but not in PyRAD, such that the latter recovers more shared loci across disparate taxa. I show through reanalysis of an empirical RADseq dataset that indels are a common feature of such data, even at shallow phylogenetic scales. PyRAD uses parallel processing as well as an optional hierarchical clustering method, which allows it to rapidly assemble phylogenetic datasets with hundreds of sampled individuals. AVAILABILITY Software is written in Python and freely available at http://www.dereneaton.com/software/.
منابع مشابه
AftrRAD: a pipeline for accurate and efficient de novo assembly of RADseq data.
An increase in studies using restriction site-associated DNA sequencing (RADseq) methods has led to a need for both the development and assessment of novel bioinformatic tools that aid in the generation and analysis of these data. Here, we report the availability of AftrRAD, a bioinformatic pipeline that efficiently assembles and genotypes RADseq data, and outputs these data in various formats ...
متن کاملPhylogenomic inferences from reference-mapped and de novo assembled short-read sequence data using RADseq sequencing of California white oaks (Quercus section Quercus).
The emergence of next generation sequencing has increased by several orders of magnitude the amount of data available for phylogenetics. Reduced representation approaches, such as restriction-sited associated DNA sequencing (RADseq), have proven useful for phylogenetic studies of non-model species at a wide range of phylogenetic depths. However, analysis of these datasets is not uniform and we ...
متن کاملRADseq dataset with 90% missing data fully resolves recent radiation of Petalidium (Acanthaceae) in the ultra‐arid deserts of Namibia
Deserts, even those at tropical latitudes, often have strikingly low levels of plant diversity, particularly within genera. One remarkable exception to this pattern is the genus Petalidium (Acanthaceae), in which 37 of 40 named species occupy one of the driest environments on Earth, the Namib Desert of Namibia and neighboring Angola. To contribute to understanding this enigmatic diversity, we g...
متن کاملGenotyping-by-Sequencing in a Species Complex of Australian Hummock Grasses (Triodia): Methodological Insights and Phylogenetic Resolution
Next-generation sequencing is becoming increasingly accessible to researchers asking biosystematic questions, but current best practice in both choosing a specific approach and effectively analysing the resulting data set is still being explored. We present a case study for the use of genotyping-by-sequencing (GBS) to resolve relationships in a species complex of Australian arid and semi-arid g...
متن کاملClustering of Short Read Sequences for de novo Transcriptome Assembly
Given the importance of transcriptome analysis in various biological studies and considering thevast amount of whole transcriptome sequencing data, it seems necessary to develop analgorithm to assemble transcriptome data. In this study we propose an algorithm fortranscriptome assembly in the absence of a reference genome. First, the contiguous sequencesare generated using de Bruijn graph with d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Bioinformatics
دوره 30 13 شماره
صفحات -
تاریخ انتشار 2014